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Abstract 

Scoring DNA sequences against Position Weight Matrices (PWMs) is a widely 
adopted method to identify putative transcription factor binding sites. While 
common bioinformatics tools produce scores that can reflect the binding strength 
between a specific transcription factor and the DNA, these scores are not directly 
comparable between different transcription factors. Here, we provide two different 
ways to find the scaling parameter A that allows us to infer binding energy from a 
PWM score. The first approach uses a PWM and background genomic sequence 
as input to estimate A for a specific transcription factor, which we applied to 
show that A distributions for different transcription factor families correspond 
with their DNA binding properties. Our second method can reliably convert A 
between different PWMs of the same transcription factor, which allows us to 
directly compare PWMs that were generated by different approaches. These two 
approaches provide consistent and computationally efficient ways to scale PWMs 
scores and estimate transcription factor binding sites strength. 
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Content 

Introduction 

Sequence-specific transcription factors (TFs) are key elements in the regulation of 
gene expression. Their binding preferences to DNA have been studied extensively 
in vitro, in vivo and using computational methods. In vitro methods such as DNase 
I footprinting [1], protein binding microarray(PBM) [2] and high-throughput SE- 
LEX measurements [3] have provided fundamental insight into the specificity of 
TF binding. The systematic compilation of DNA sequences from such experiments 
(and along with them catalogues such as TRANSFAC [4] or JASPAR [5]) have long 
suggested that TFs do not just bind to one DNA motif, but can bind to a repertoire 
of similar sequences. Stacks of such sequences give rise to alignment matrices, in 
which each column represents the absolute count of A, C, G and T nucleotide occur¬ 
rences per position along the length of the motif. Work by Berg et al. [6] introduced a 
derivative of the alignment or position frequency matrix (PFM), the position weight 
matrix (PWM), which takes the log likelihood of observing nucleotides taking their 
overall frequency into account. Berg et al.[ 7] later showed that the score obtained by 
comparing the PWM against a DNA sequence is proportional to the binding energy 
between this TF and the DNA. In most cases the actual binding energy between 






Ma et al. 


Page 2 of 18 


the protein and DNA is not known, and the proportionality is scaled with a factor 
commonly termed A. 

There is no well-characterized and easily computable way to determine the TF 
binding energy to specific DNA sequences and to compare binding site strength 
between different types of TFs at large scale. This is problematic when scanning 
the genome with a library of PWMs, as scoring functions treat each PWM inde¬ 
pendently, and the absolute score associated with a “good match” to the PWM of 
one transcription factor might be associated with a mismatch for another factor. A 
more sophisticated application of binding site strength estimation is, for example, 
modeling the relationship between enhancer occupancy and gene expression [8, 9]. 
The experimental PBM approach [2] allows the estimation of the relative binding 
strength of a protein to “naked” DNA in vitro , but the data availability is restricted 
to a limited number of TFs due to high cost of the technology. In addition, PBMs 
are also not suitable for TFs with longer motifs, as their accuracy will decrease with 
the length of the DNA probe [2], Therefore, PWM-based approaches are used to 
computationally estimate TF binding affinity to a specific sequence [8, 10]. 

In the majority of bioinformatic studies, the scaling factor A is unknown and 
PWM scores are used at face value as measure of affinity. For example, in our own 
work [11] we used the PWM score without scaling to compare binding site strength 
across different TFs in E. coli , which might lead to a bias due to the absolute 
differences between the highest and lowest PWM scores across all TFs of interest. 
Other work has tried to assess the range of A on the basis of fitting calculated affinity 
landscapes to ChIP-seq profiles [12, 13]. However, ChIP data is intrinsically noisy 
and the height of a ChIP peak may not accurately represent the real binding affinity, 
undermining the stability and accuracy of A obtained from these methods. In Roider 
et.al [12], the estimated A for the same TF in different conditions diverged greatly 
in nearly one third of TFs. Furthermore, there is a wide band of possible A values 
that optimize the correlation. Aforementioned fitting methods are further reliant 
on chromatin accessibility data acquired under the same experimental conditions, 
which is often not available for specific conditions and TFs. 

We propose a simple approximation to estimate the scaling parameter A based on 
existing PWM matrices, average maximum mismatch energy tolerance estimated by 
high-throughput binding energy measurements [14] and the distribution of PWM 
scores across the genome of a specific organism. This method is independent of 
genome-wide binding and accessibility data. Furthermore, in the cases where there 
are potentially inconsistent PWMs for a particular TF (e.g. derived on the basis 
of individual binding sites vs. derived from high-throughput efforts), we provide a 
method to convert the known A for one PWM matrix of the same TF into another 
suitable value for a new PWM matrix. This method is based on a computational 
model of the facilitated diffusion of TFs on the DNA that our group established 
earlier [15]. We calculate sequence-specific residence times of TFs at the DNA, 
which is correlated with affinity. We can therefore derive A for different PWMs of 
the same TF on the basis of the consistency of simulated residence time. These two 
strategies (a) calculating A to scale PWM scores based on the mismatch energy 
theory using a simple equation and (b) converting the scaling parameter A between 
different PWM matrices of the same TF on the basis of simulated residence time of 
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facilitated diffusion provide simple but useful estimations of binding energy across 
different TFs using properly scaled PWM scores. 


1 Methods 

PWM matrices for TFs for yeast, fly and vertebrates 

Position frequency matrices (PFM) used to construct PWM matrices were down¬ 
loaded from the JASPAR database (JASPAR-CORE-2014 non-redundant PFM) 
[5]. Additional sources of PFM such as those contained in the BioConductor pack¬ 
age PWMEnrich.Dmelanogaster. background [16] were used as a source of different 
matrices for the same TFs. PFMs constructed with less than 30 reference sequences 
of validated binding sites were removed, as we deemed those insufficient descrip¬ 
tions of binding preference. Given that typical TF binding sites span at least six 
base pairs, we removed any motifs less than 6 base pairs in length. 

A bioinformatics approach was used to derive PWM scores [17] as follows: 

sj= i> g2 Si (i> 


where j is the DNA position for the PWM score calculation, L is the length of the 
motif and k represents k th nucleotide in the PWM motif. In addition, if there is 
a specific nucleotide in position (j+k) on the DNA, fj + k is the frequency of this 
nucleotide in the whole genome of a specific organism. We used a pseudo-count /r 
to adjust the frequency of nucleotides and obtain Vj^ to avoid zero frequency as 
follows [18] 


II j , k fj+k ' M 

J2x n *,k + n 


( 2 ) 


where /i is chosen to be 1 [18] and we also show that the choice of the pseudo-count 
H does not have significant influence on our results (Supplementary Figure 6); n x ,k 
is the frequency of certain nucleotide x in a specific position k of the motif. 

Simple equation to calculate A 

A is the scaling factor that allows for direct comparison of different PWMs in 
terms of binding energy to DNA. The binding energy of a TF to the DNA at a 
specific position can be expressed as: 


E = E 0 ■ e~ Sj,x 


( 3 ) 


where E is the binding energy, Sj is the PWM score and E 0 is a scaling parameter. 
This is useful in a variety of contexts, such as comparing the binding strength of 
different TFs. In addition, the expected amount of time that the TF is bound to a 
particular DNA sequence can be estimated as: 

T j = to (A) • e~ Sj / x (4) 

where Sj is the PWM score at position j in the genome, r 0 is the average residence 
time calculated as in [15]. This equation is widely used in simulations of TF binding 
kinetics [19]. 
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Given the utility of the A for estimating binding strength and occupancy time, it is 
very important to have a simple strategy for estimating it. We derive our equation 
based on the following core assumptions: 1. The top 0.1% of the highest scoring 
matches of the PWM to intergenic regions are considered to be possible TF binding 
sites, as suggested by [20]. Their genome-wide study of different eukaryotic TFs 
revealed an average of 1 binding site in every 1-5 thousand base pairs of intergenic 
sequence. This top 0.1% score threshold has also been similarly adopted in other 
studies [10]. 2. The maximum mismatch energy between the consensus binding motif 
and specific DNA sequences is proportional to the information content of the PWM 
matrix of the TF. 

Based on the mismatch energy theory for estimating TF binding strength [7], the 
mismatch energy at a particular binding site j of TF species i in the genome can be 
expressed as: 


^mismatch, i,j ^Sij/Xi (Smax,i 


( 5 ) 


where Si j stands for the PWM score at position j, S rnax , is for the maximum PWM 
score of TF species i and A, i is the scaling parameter we want to estimate. 

The lower boundary of potential binding sites is approximated by the top 0.1% 
of PWM scores following the same reason as mentioned before and corresponds to 
the maximum mismatch energy tolerance level as follows 


E, 


maxMismatch.i — 


Smax,i ^top0.1%,i 
Ai 


where E max Mismatch,i stands for maximum mismatch energy tolerance for TF 
species i, thus, \ can be calculated using: 


A,: = 


Sr, 


- S , 


top0.1%,i 


E, 


max Mismatch,! 


( 6 ) 


Because different transcription factors have different DNA binding domains, the 
maximum mismatch energy range can vary from one TF to another. Since there is 
only data available for 4 individual TFs using microfluidic platform-based binding 
energy measurements [14], we estimated the maximum mismatch energy for other 
TFs by using the available data as the average rate and assuming that the mismatch 
energy tolerance is proportional to the information content of the PWM matrix as 
follows: 


EmaxMismatch,i —^ ^ X 


Ifi 


<If> 


( 7 ) 


where Ifi represents the information content of a specific PWM matrix, < If > 
stands for the average information content corresponding to the average maximum 
mismatch energy measured by [14], which is 13.2 bits. 

We chose an average mismatch energy tolerance of 6 bits based on the study by 
[14]. They showed by mechanical trapping of molecular interactions a significant de¬ 
cline in binding energy by at most 2 to 3 nucleotide mismatches, and each mismatch 
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nucleotide contributes 2 bits in mismatch energy. Even if more mutations are in- 
troduced, the binding energy does not drop further since it has already reached the 
background non-specific binding energy level. Another report featuring TFs from 
different families including: p53, Max, Glucocorticoid Receptor [21] also provides 
additional support for 6 bits as average mismatch energy tolerance level. 

Estimating A of a new PWM matrix for the same TF based on the 
residence time landscape of the facilitated diffusion model 

Sometimes there may be more than one PWM available for a specific TF. In order 
to directly compare the TF binding energy when using two alternative versions of a 
PWM, it is important to have a way of scaling the results by A. A can be adjusted 
using the formalism introduced in the previous sections. As a compute-efficient 
alternative, we developed a more optimal strategy for estimating A, which does not 
require the assumption that the PWM information content influences the energy 
mismatch tolerance. Instead, we base our strategy on the estimation of the sequence 
specific residence time of a particular TF, which is a biological meaningful quantity 
and can be correlated with in vitro sequence-dependent sliding measurement of TFs 
[10]. For the same TF, the sequence-specific residence time distribution calculated 
by Equation 4 should be as consistent as possible, even when using slightly different 
PWMs, if an appropriate A is chosen for scaling. Based on this, given a known A 
for one PWM, we are able to find another suitable A for the new PWM. 

Note that the stronger the PWM score, the more likely it is that the sequence 
is bound by a TF and that the TF’s residence time is a biologically meaningful 
quantity, but there is a much greater number of weak and medium strength binding 
sites than strong sites in the genome. Therefore, if we scored each potential bind¬ 
ing site equally, the background of weak and medium-strength binding sites would 
have a greater affect on the estimated A than the strong binding sites. Therefore we 
compare residence times across different quantiles on a logarithmic binding strength 
scale, so that the strongest binding sites have the most influence on our A estimates. 
Specifically, in the following analysis, we take the logio of the cumulative distribu¬ 
tion of PWM scores and select all binding sites with values greater than 3.0 (recall 
that this corresponds to the 0.1% percent of binding sites, which were chosen as the 
lower boundary of weak binding sites). We divide these top-scoring binding sites 
into bins every 0.1 log-quantile and calculate the average residence time for each of 
these bins. 

Our strategy identifies the A that would produce the most similar residence times 
for each of these log-quantiles. Assuming that for the first PWM, we already have 
an estimate of A by either binding profile fitting or other methods, we can use 
Equation 4 to calculate the residence time for each binding strength log-quantile, 
as described above. In the following analysis of this paper, since there are very few 
well-characterized A values from profile fitting, for proof-of-principle, we borrow the 
values obtained from Equation 6 as pre-calculated A. Note that To is calculated via 
the strategy described in [15] from all intergenetic regions in the genome, which has 
a different value for each unique PWM. 

Now for the second PWM, we can vary A between the potential values of 0.1 and 3, 
which was shown to be a possible A range [12], and calculate the corresponding res¬ 
idence times at each log-quantile level. We can now compare the reference residence 
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times from the first PWM with the residence times for the second PWM across each 
binding site strength level, and for each value of A. The A that minimizes the mean 
square error between two sets of calculated residence times is chosen as the suitable 
A value for the second PWM matrix. Since outliers can have a big influence on the 
mean square error, we calculated the sum of the absolute differences for the natural 
logarithm of residence times between the two PWM matrices for these quartile bins 
(Equation 8) to make a comparison with the method that uses mean square error. 

^Z q \ lnT q,X -lnr 9iT . e/ | (8) 

where q represents each quantile in the quantile series, t 9j a is the residence time in 
a specific quantile of a particular A, T q ^ re f is the residence time in the same quantile 
of the known A of the reference PWM matrix. The A derived by these two methods 
show good consistency with adjusted coefficient of determination of 0.9644. Thus, 
there should not be significant bias using either of these two methods. 

2 Results 

Estimating scaling parameter A for binding site affinity across different 
species and TF families based on Equation 6 

The A parameter is the critical link between PWM score, the estimated binding 
energy and TF residence time. Estimating TF binding site affinity by comparing 
PWM scores at face value can lead to a large bias, especially when this includes 
comparisons between many types of TFs, because several properties of the PWM 
matrix itself can influence the PWM score. For example, the information content of 
the PWM matrices is positively correlated to the maximum possible PWM score, 
as is shown in Figure 1 with an R 2 value of 0.597. Thus, the absolute value of PWM 
scores cannot be compared directly across different TFs as an indicator of binding 
site strength. Therefore, proper scaling of PWM score is needed in order to compare 
binding site affinity across different types of TFs. Based on the methods proposed 
by Berg et al. [7], the TF binding energy for a specific binding site can be computed 
by Equation 3 using the estimated A. 

A calculated by this method are all within the range suggested by [12], which 
are listed in Table 1 for different organisms. The values for vertebrate species refer 
to all available vertebrate TFs obtained from the non-redundant PFM JASPAR 
database. The upper and lower bound of A across all organisms are quite similar, in 
the range of 0.25 to 2.83. This indicates that all eukaryotic TFs, no matter which 
organisms they belong to, all share energetically similar DNA binding mechanisms, 
since A can be interpreted as a metric for the chemical property of stickiness be¬ 
tween the TF molecule and DNA. To demonstrate the biological application of this 
parameter, Figure 2 shows an example of the D. melanogaster Even-skipped, stripe 
1 enhancer with the comparison between PWM score and the affinity estimation 
using A scaling. The usefulness of A estimates becomes apparent when comparing 
the first two binding sites indicated by blue arrows in this locus; the second bind¬ 
ing site has a higher PWM score, but its binding strength is lower than the first 
binding site once the A scaling factor is taken into account. Similar situations also 
appear in the overlapping binding site of Bicoid and Kruppel indicated by the third 
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arrow. Thus, only comparing the raw value of PWM score [11] may lead to false 
interpretations of binding site importance. When comparing binding site strength 
for different TFs, we need to use A to adjust the PWM score. 

Next, we calculated A for each TF in S. cerevisiae , D. melanogaster and available 
vertebrate TFs in JASPAR [5], which are listed in Supplemental Table 1. Figure 
3 shows the overall A distribution in each group of organisms, showing that the 
mean values of the A distribution shift from low to high from S. cerevisiae to D. 
melanogaster , and then to the vertebrates. The difference in average information 
content alone does not fully explain this discrepancy (Supplement Figure 1). Inter¬ 
estingly, there is also a shift of mean information content from low to high values 
from S. cerevisiae to vertebrates, but since the information content is the denom¬ 
inator in Equation 6, it cannot account as a direct cause of the observed trend 
in the change of A distribution. Instead, the range of PWM scores that fall into 
the top 0.1% represented in the numerator of Equation 6 mainly contribute to the 
differences of the A distribution. 

Furthermore, the comparison of A across different TF families according to the 
classification in JASPAR [5] is illustrated in Figure 4. Zinc-finger nuclear receptor 
family, the basic leucine-zipper family and helix-loop-helix family are the three 
families with highest average values of A compared with other groups with t-test 
p-values equal to 8.3 • 10 _5 ,1.4 • 10~ 4 and 6.1 • 10 -4 respectively. Homeobox and 
forkhead TF family, both of which belong to the helix-turn-helix(HTH) TF super 
family, appear to have lower average A compared with the former three families and 
no difference is detected between these two using t-test (p-value=0.98). The /J-/3- 
a zinc-finger family, the largest TF family in higher eukaryotes, shows relatively 
high A compared to the HTH super-family including the homeobox and forkhead 
sub-families with t-test p-value of 0.13, and it also has a wide range of A owing to 
the great diversity of binding motifs [22]. The high mobility group family shows no 
significant differences from HTH super-family with t-test p-value of 0.47. 

Since A is the denominator to the PWM score differences between one binding 
site and the consensus sequence in Equation 3, a larger A indicates lower mismatch 
energy when A Sj is the same. Thus, with the same possible mismatch energy range, 
if A is larger, the PWM score can have a greater range from the consensus sequence 
to the potentially weakest binding site, which indicates the binding motif for the 
TF family has higher flexibility as suggested by [23]. This is consistent with the 
fact that the TFs in the zinc-finger super-family including the nuclear receptor 
and /3-(3-a zinc-finger families are less constrained to a particular motif than HTH 
super family. Additionally, cross species comparison of A indicates that from yeast 
to vertebrate, more flexible TF motifs are used, which is consistent with the result 
from [22] that organisms which appeared more recently in evolution tend to use 
more TFs with motifs of higher flexibility. 

Converting A between different PWM matrices of the same TF 

In many cases there are two PWMs available for the same TF, and one of these 
PWMs might already have a reliable estimate of A, from any number of experimental 
or computational approaches [13]. In such circumstances, we provide a strategy to 
estimate the unknown A associated with the alternative PWM matrix. It would be 
possible to calculate the unknown A from Equation 6, but this does not incorporate 
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the additional data available (i.e. the known A). Our alternative strategy not only 
incorporates this data, but also loosens the assumption in Equation 6 that the 
maximum mismatch energy for DNA binding is proportional to information content. 

The procedure to compute a suitable A is based on the concept of sequence-specific 
residence time (Equation 4), as illustrated in Figure 5. Initially, a well-characterized 
A is computed or measured for the first PWM of a particular TF, and then we 
use this value to derive a A that is appropriate for the second PWM of the same 
TF. As part of the calculation of the A for the second PWM, Figure 5C shows a 
heatmap of the estimated residence times for a TF named lame duck (lmd) in a 
particular binding strength quantile, at different values of A (ranging from 0.1 to 
3.0 as suggested by both [12] and the range of estimated A using Equation 6 across 
different organisms). Both PWMs for the TF come from FlyFactorSurvey database 
[24], but they are derived from different reports with motif logos shown in Figure 5B. 
Blank regions in the heatmap indicate the choice of A would generate a residence 
time outside the range of pre-calculated possible residence times using the first 
PWM and the existing A value, implying that these A values for the second PWM 
are unsuitable. As shown in the heatmap, blank regions often appear in very low 
values of A, while if A is too large, the possible residence time range from weak to 
strong binding sites is often very restricted, which means high affinity sites cannot 
be distinguished from low affinity sites efficiently. A values with residence times all 
within the reference range can be further selected, as specified in Methods. Figure 
5D-F compares the residence time values between two different PWMs, at different 
values of A for the second PWM. We see that the A in Figure 5D and 5F would not 
allow for consistent residence times between the two PWMs, but Figure 5E does 
provide consistent results. Therefore, the A adopted in Figure 5E is picked up as 
the suitable value for the second PWM matrix. More examples of residence time 
heatmaps for converting A between different PWMs are shown in Supplementary 
Figure 3. 

In order to evaluate the consistency of A estimation between the above method and 
using Equation 6, we use the examples of 20 D.melanogaster TFs with more than 1 
version of PWMs available from different experiments. These PWMs are obtained 
from the BioConductor R package PWMEnrich.Dmelanogaster.background [16] and 
their labels are listed in Supplementary Table 2. Since there are only few A available 
from binding profile fitting, just for the purpose of illustration, the reference values 
of A were pre-calculated from Equation 6 instead. New A values for PWMs obtained 
from other experiments are computed using both methods and they show good 
consistency with each other with adjusted R 2 equals to 0.88 (Supplementary Figure 
4). Converting A between these two PWMs in the opposite direction also show 
similar results (data not shown). It indicates that both methods provide consistent 
estimates of A, even though they have different core assumptions. 

3 Discussion 

TF binding site strength estimation using PWM-based methods is essential for mod¬ 
elling TF-DNA interaction in functional genomics; but a proper scaling parameter 
is needed when using the PWM score to estimate TF binding energy. Therefore, 
we provide two independent methods for estimating the scaling parameter A in 
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different conditions. The simple Equation 6 is widely applicable, since it only re¬ 
quires a PWM matrix as input, which is easy to implement compared to methods 
using fitting to ChIP-seq [12, 13]. Our second method converts a A specific to one 
PWM into A for a different PWM of the same TF. It is based on the definition of 
sequence-specific residence time from the facilitated diffusion model of TFs on DNA 
[15]. This method is particularly useful for converting a previously estimated A into 
the one associated with a more up-to-date or otherwise alternative PWM matrix. 

These two methods are consistent with one another (Supplementary Figure 4) 
and with previously established methods. For instance, Equation 6 can also provide 
very similar results compared with the estimated A from ChIP-seq data fitting for 
the 5 D. melanogaster TFs in [13], which is shown in Supplementary Figure 5. The 
consistent value range of A in different organisms calculated by this method pro¬ 
vides additional support for the applicability of this simple equation. Moreover, the 
estimated distribution of A values for different TF families make sense in the light 
of motif choice for each TF families [25]. For example, TFs in the zinc-finger TF 
super-family including nuclear receptor zinc-finger and j3-j3-a zinc-finger families 
have more flexible binding motifs, which can suit a wider range of possible binding 
sites than the helix-turn-helix super-family, which has a more restricted motif con¬ 
sensus sequence [26]. Inside the same super-family, because of differentiated DNA 
binding domains and functions, nuclear receptor zinc-finger and /3-/3-a zinc-finger 
families also show significant differences of A distribution, with t-test p-value of 
0.006. However, some TF families belong to the same super-family and also share 
similar binding domain properties can have strong similarity in A distribution, e.g. 
homeobox family and forkhead family which both belong to the helix-turn-helix 
super-family. 

There are some points that should be noted when using the simple equation 
method: first, it cannot be applied to very short TF motifs that are less than 6 base 
pairs in length. Since this method depends on calculating the difference between the 
top 0.1% of PWM scores and the maximum score, if the motif is only 5 base pairs 
in length, the number of possible choices for sequence combination of 5 base pairs 
is only 1024, then the top 0.1% of PWM scores is the top score. However, most 
eukaryotic motifs are more than 6 base pairs long. Eukaryotic TFs on average cover 
15 bp of DNA with a core motif length of 8-15 bp [8]. Thus, this limitation should 
not be a problem in the majority of cases. Another assumption in this method is 
that the mismatch energy tolerance range measured in bits is proportional to the 
information content of the PWM matrix. This assumption can deal with the bias 
from the differences in information content of most PWM matrices; however, it 
might not hold for PWM matrices with extremely high information content. For 
example, the yeast transcription factor IXR1 has an information content of 47 bits 
according to the PFM matrix from JASPAR [5], which is substantially larger than 
the average information content of 13.2 bits. In that case, the binding energy will 
probably be overestimated, which leads to a lower A, but these cases are very rare 
and only 7 PWM matrices in our analysis (less than 1.5%) have information content 
greater than 20. The mean of information content and the mean of estimated A 
for each TF families are listed in Supplementary Table 3, and they show weak 
positive correlation with p-value of 0.056 (full distribution of information content 
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across different organisms and in different TF families are shown in Supplementary 
Figure 3 and 4). Interestingly, the information content is actually the denominator 
in Equation 7 to calculate A which means the information content does not affect 
the results of A directly. It is the mismatch energy tolerance differences between 
different TF families that contributes to the variation in A distribution. 

There are two limitations of this method which can potentially lead to some bi¬ 
ases between different organisms and different TF families. One limitation is related 
to the calculation of mismatch energy tolerance in different groups of TF families. 
We apply a single cut-off threshold of top 0.1% PWM score for weak binding sites 
suggested by [20], but it could be possible for different TF families, different thresh¬ 
old should be used due to variations in their DNA binding domains. However, it 
is difficult to choose specific thresholds for every TF family based on the currently 
available data. Another limitation is the assumption that mismatch energy toler¬ 
ance of a TF is proportional to the information content of the PWM matrix. It is 
possible that such relationship is not linear but more complicated, which is difficult 
to verify. Further, from the definition of information content of the PWM matrix, 
it sums up information content gain from each nucleotide [19], which implies longer 
motifs including more flanking base-pairs will have slightly higher information con¬ 
tent compared to the shorter ones just with core motifs, which is an artefact of 
computation. However, there is no satisfactory way to deal with this problem. One 
possible solution is using the information content per nucleotide instead of the to¬ 
tal information content, but this may be even more detrimental as the information 
content contributed by flanking sequences constitutes only a very small fraction 
compared to core motifs. Thus, if dividing total information content by the length 
of the motif, the dilution of information content can lead to even bigger biases. An¬ 
other potential solution is trying to define a core motif from one PWM matrix, but 
this requires detailed knowledge about the TF of interest. Additionally, A will not 
be a reliable measure of biochemical stickiness of the TF to the DNA if the PWM 
itself is not an accurate representation of TF binding. A PWM assumes that each 
nucleotide position independently contributes to TF binding affinity, which may not 
be the case [27, 28] . In addition, the composition of the position frequency matrix 
of the PWM may contain biases due to the difficulties of attaining an unbiased 
validated binding site set. Nevertheless, A can give us insights about DNA binding 
properties of TFs. 

Also, it should be pointed out that residence time in this paper refers to an 
estimate based on the biophysical model proposed in [10, 15]. However, other pa¬ 
pers report inconsistent scales of residence time according to different experimental 
approaches. For example, the residence time estimations obtained by Competition- 
ChlP methods [29] do not share the same order of magnitude compared to the 
residence times measured by FRAP or single molecular tracking [30, 21, 31], which 
can probably be an artefact of experimental methods or alternatively, the range 
of residence time truly varies greatly across different TFs [32]. Because the exper¬ 
imentally determined values are not comparable to each other, we simply adopt 
bioinformatics-based approaches to compute residence time. Since our method con¬ 
verts A between different PWM matrices of the same TF under the concept of 
residence time, it avoids to fit inconsistent experimental observations and potential 
variations in DNA-binding kinetics for different TFs. 
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Although in many cases PWMs are not optimal representations of binding mo¬ 
tifs, they have become almost universally adopted to identify potential TF binding 
sites. However, it is important to remember that the value of a PWM score is not 
directly correlated to the binding energy, but rather depends on the scaling pa¬ 
rameter A. Previously, researchers either assumed that A has similar values across 
different PWMs or estimated it through computational intensive binding profile fit¬ 
ting methods [12, 13]. Here we provide two simple strategies for estimating A, which 
will let us more clearly link PWM scores with the energetics of TF binding. 
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Figures 



Figure 1: The relationship between maximum PWM score and information conten, 
of PWM matrices. Individual dots represents each PWM matrix generated from 
the non-redundant PFM JASPAR-CORE database [5] after the filtering procedures 
specified in the Methods section. There is a strong positive correlation between the 
information content of the PWM and the maximum possible PWM score that could 
be generated by that PWM, with an R 2 value of 0.597. 
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Figure 2: A comparison between PWM score and binding site strength in the I) 
melanogaster even-skipped stripe 1 enhancer. The even-skipped stripe 1 enhance: 
on chromosome 2R is dense with binding sites. We compare the raw PWM score: 
(circles) and the A-scaled binding strength (height of the bars) for each of thesi 
binding sites, colour-coded by the type of TF. Based on raw PWM scores, on 
might assume that the Caudal site indicated by the first blue arrow would have 
lower binding strength than the Kruppel site indicated by the second blue arra 
however, once the binding strength is scaled by A using Equation 5, it become; 
evident that the opposite is the more likely scenario. The third arrow points to 
location where a Kruppel and a Bicoid binding site overlap. Here, the A adjust 1 
binding strength estimates would suggest that Bicoid binding site is stronger, whil 
a raw PWM score would suggest the opposite. These results illustrate how usin; 
raw PWM scores may result in biased interpretation of the relative binding streng' 
of TFs. 
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Table 1: Maximum, minimum and the average values of A in 3 groups of organisms. 



S. cerevisiae 

D. melanogaster 

Vertebrates 

maximum 

2.83 

2.72 

2.82 

minimum 

0.26 

0.35 

0.25 

mean 

1.25 

1.40 

1.73 
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Figure 3: A distributions across 
values estimated from Equation 
cerevisiae (A), D. melanogaster 


difference organisms. The histograms depict the A 
6 for the JASPAR non-redundant core motifs in $ 
(B) and available vertebrates (C) [5]. 
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Figure 4: A distribution comparison across major TF families. BBA-ZF represeni 
the A distribution for f3-/3-a zinc-finger family; NR is zinc-finger nuclear recepto: 
family; L-zipper stands for the basic leucine-zipper family; HLH is helix-loop-heli: 
family; Homeo is homeobox family; FK is fork-head family and HMG is high mi 
bility group family. For each group, A was calculated by Equation 6. 
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Figure 5: Conversion of A between two PWM matrices for the Imd, transcriptic 
factor. The flow chart shows the procedure to obtain an optimised A, given tw 
different PWMs and one known and one unknown A (A). Subfigure B illustrate;; 
the two alternate PWMs for lmd which are available in the FLYFACTORSURVE 
database [24]. Equation 6 suggests that PWM1 has a A of 1.6, and we are tryin; 
to find a A for PWM2. Subfigure C is a heatmap of the residence time distributio: 
of PWM2 for different values of A and different binding site strength level. Eai 
column of the heatmap represents a specific A value and each row represents 
specific binding site strength level measured by the —log w of the correspond^; 
top quantiles from low affinity to high affinity sites. Blank regions in the heatma 
indicates A values will lead to residence time out of the reference scale which is a: 
indication of unsuitable A values. D, E and F show the correlation of residence turn 
between PWM1 and PWM2 using specific A values of 0.8, 1.4 and 2.0, respectively. 
The curve in subfigure E has the lowest mean square error, and so we assign PWM 
to have a A = 1.4. 
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Supplementary Figure 1. Information content across different organisms. The his¬ 
tograms describe the information content distribution for S. cerevisiae (A), D. 
melanogaster (B) and the available vertebrates’s motifs (C) from JASPAR noil- 
redundant core motifs [5]. 



Supplementary Figure 2. Information content comparison across major TF families. 
BBA-ZF represents the A distribution for fj-fl-a. zinc-finger family; NR is zinc-finge: 
nuclear receptor family; L-zipper stands for the basic Leucine-zipper family; HL 
is helix-loop-helix family; Homeo is homeobox family; FK is forkhead family ani 
HMG is high mobility group family. 
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Supplementary Figure 3. Hecitmaps for A conversion between different PWMs. These 
are additional examples of heatmaps of sequence-specific residence time that are 
used for A conversion between different PWM matrices of the same TF. Alter¬ 
native versions of PWM matrices are from BioConductor R package of PWMEn 
rich. Dmelanogaster, background [16]. Each column of the heatmaps represents a spe¬ 
cific A value and each row represents a specific binding site strength level. 
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Supplementary Figure 4. Consistency of A estimation between two methods. This 
figure shows the correlation between A values obtained from Equation 6 and fror 
A conversion using the heatmap of sequence-specific residence time. The adjusted 
R 2 is 0.88. 
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Supplementary Figure 5. Comparison between A values calculated by Equation 6 and 
fitting ChIP-seq profile. This figure depicts the A comparison between Equation 6 
and previously established method using binding profile fitting for 5 D.melanogaster 
TFs [13]. GT, HB, BCD, CAD and KR are the abbreviations for Giant, Hunchback 
Bicoid, Caudal and Kruppel respectively. The adjusted R 2 is 0.64. 
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Supplementary Figure 6. Comparison of X values calculated by using different 
pseudo-count values in PWM matrices. Subfigure A shows the comparison between 
the A values obtained by using PWM matrices with pseudocounts of 1 and 3 (the 
adjusted R 2 is 0.973), while subfigure B compares pseudocounts of 1 and 0.3 (the 
justed R 2 is 0.978). Each dot represents a TF from 100 randomly chosen vertebrab 
TFs in JASPAR database [5]. 
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